AITopics | audio codec

Collaborating Authors

audio codec

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FromDiscreteTokenstoHigh-FidelityAudioUsing Multi-BandDiffusion

Neural Information Processing SystemsFeb-7-2026, 09:26:33 GMT

Deep generativemodels cangenerate high-fidelity audio conditioned onvarious types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations.

artificial intelligence, arxivpreprintarxiv, machine learning, (18 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.94)

Add feedback

Adapting Neural Audio Codecs to EEG

Kastrati, Ard, Lanzendörfer, Luca, Rigoni, Riccardo, Matilla, John Staib, Wattenhofer, Roger

arXiv.org Artificial IntelligenceDec-1-2025

EEG and audio are inherently distinct modalities, differing in sampling rate, channel structure, and scale. Yet, we show that pretrained neural audio codecs can serve as effective starting points for EEG compression, provided that the data are preprocessed to be suitable to the codec's input constraints. Using DAC, a state-of-the-art neural audio codec as our base, we demonstrate that raw EEG can be mapped into the codec's stride-based framing, enabling direct reuse of the audio-pretrained encoder-decoder. Even without modification, this setup yields stable EEG reconstructions, and fine-tuning on EEG data further improves fidelity and generalization compared to training from scratch. We systematically explore compression-quality trade-offs by varying residual codebook depth, codebook (vocabulary) size, and input sampling rate. To capture spatial dependencies across electrodes, we propose DAC-MC, a multi-channel extension with attention-based cross-channel aggregation and channel-specific decoding, while retaining the audio-pretrained initialization. Evaluations on the TUH Abnormal and Epilepsy datasets show that the adapted codecs preserve clinically relevant information, as reflected in spectrogram-based reconstruction loss and downstream classification accuracy.

artificial intelligence, codec, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.23142

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Neurology > Epilepsy (0.37)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

ADNAC: Audio Denoiser using Neural Audio Codec

Jimon, Daniel, Vaida, Mircea, Stan, Adriana

arXiv.org Artificial IntelligenceNov-4-2025

--Audio denoising is critical in signal processing, enhancing intelligibility and fidelity for applications like restoring musical recordings. This paper presents a proof -of-concept for adapting a state -of -the -art neural audio codec, the Descript Audio Codec (DAC), for music denoising. This work overcomes the limitations of traditional architectures like U - Nets by training the model on a large-scale, custom -synthesized dataset built from diverse sources. Training is guided by a multi-objective loss function that combines time-domain, spectral, and signal -level fidelity metrics. Ultimately, this paper aims to present a PoC for high -fidelity, generative audio restoration. Noise reduction is a fundamental part of audio signal processing, substantially improving signal quality and intelligibility across domains like speech processing [1-3], music production and restoration [1], and bioacoustics analysis [2].

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

2511.01773

Country:

Europe (0.68)
North America > United States (0.28)

Genre: Research Report (1.00)

Industry:

Media (0.46)
Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

PitchFlower: A flow-based neural audio codec with pitch controllability

Torres, Diego, Roebel, Axel, Obin, Nicolas

arXiv.org Artificial IntelligenceOct-30-2025

Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFi-GAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.

disentanglement, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.25566

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)
Information Technology > Artificial Intelligence > Speech (0.69)
Information Technology > Artificial Intelligence > Natural Language (0.69)

Add feedback

U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

Yang, Xusheng, Zhou, Long, Wang, Wenfu, Hu, Kai, Feng, Shulin, Li, Chenxing, Yu, Meng, Yu, Dong, Zou, Yuexian

arXiv.org Artificial IntelligenceOct-21-2025

We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $\times$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.16718

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning

Zhao, Junchuan, Wang, Xintong, Wang, Ye

arXiv.org Artificial IntelligenceSep-30-2025

Recent advances in discrete audio codecs have significantly improved speech representation modeling, while codec language models have enabled in-context learning for zero-shot speech synthesis. Inspired by this, we propose a voice conversion (VC) model within the V ALLE-X framework, leveraging its strong in-context learning capabilities for speaker adaptation. To enhance prosody control, we introduce a prosody-aware audio codec encoder (P ACE) module, which isolates and refines prosody from other sources, improving expressiveness and control. By integrating P ACE into our VC model, we achieve greater flexibility in prosody manipulation while preserving speaker timbre. Experimental evaluation results demonstrate that our approach outperforms baseline VC systems in prosody preservation, timbre consistency, and overall naturalness, surpassing baseline VC systems.

large language model, machine learning, voice conversion, (19 more...)

arXiv.org Artificial Intelligence

2505.15402

Country: Asia > Singapore (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.64)

Add feedback

Spectrogram Patch Codec: A 2D Block-Quantized VQ-VAE and HiFi-GAN for Neural Speech Coding

Chary, Luis Felipe, Ramirez, Miguel Arjona

arXiv.org Artificial IntelligenceSep-3-2025

We present a neural speech codec that challenges the need for complex residual vector quantization (RVQ) stacks by introducing a simpler, single-stage quantization approach. Our method operates directly on the mel-spectrogram, treating it as a 2D data and quantizing non-overlapping 4x4 patches into a single, shared codebook. This patchwise design simplifies the architecture, enables low-latency streaming, and yields a discrete latent grid. To ensure high-fidelity synthesis, we employ a late-stage adversarial fine-tuning for the VQ-VAE and train a HiFi-GAN vocoder from scratch on the codec's reconstructed spectrograms. Operating at approximately 7.5 kbits/s for 16 kHz speech, our system was evaluated against several state-of-the-art neural codecs using objective metrics such as STOI, PESQ, MCD, and ViSQOL. The results demonstrate that our simplified, non-residual architecture achieves competitive perceptual quality and intelligibility, validating it as an effective and open foundation for future low-latency codec designs.

artificial intelligence, codec, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2509.02244

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

L3AC: Towards a Lightweight and Lossless Audio Codec

Zhai, Linwei, Ding, Han, Zhao, Cui, wang, fei, Wang, Ge, Zhi, Wang, Xi, Wei

arXiv.org Artificial IntelligenceAug-18-2025

Neural audio codecs have recently gained traction for their ability to compress high-fidelity audio and provide discrete tokens for generative modeling. However, leading approaches often rely on resource-intensive models and complex multi-quantizer architectures, limiting their practicality in real-world applications. In this work, we introduce L3AC, a lightweight neural audio codec that addresses these challenges by leveraging a single quantizer and a highly efficient architecture. To enhance reconstruction fidelity while minimizing model complexity, L3AC explores streamlined convolutional networks and local Transformer modules, alongside TConv--a novel structure designed to capture acoustic variations across multiple temporal scales. Despite its compact design, extensive experiments across diverse datasets demonstrate that L3AC matches or exceeds the reconstruction quality of leading codecs while reducing computational overhead by an order of magnitude. The single-quantizer design further enhances its adaptability for downstream tasks. The source code is publicly available at https://github.com/zhai-lw/L3AC.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.04949

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

Casanova, Edresson, Neekhara, Paarth, Langman, Ryan, Hussain, Shehzeen, Ghosh, Subhankar, Yang, Xuesong, Jukić, Ante, Li, Jason, Ginsburg, Boris

arXiv.org Artificial IntelligenceAug-11-2025

Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2508.05835

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

SpectroStream: A Versatile Neural Codec for General Audio

Li, Yunpeng, Han, Kehang, McWilliams, Brian, Borsos, Zalan, Tagliasacchi, Marco

arXiv.org Artificial IntelligenceAug-8-2025

We propose SpectroStream, a full-band multi-channel neural audio codec. Successor to the well-established SoundStream, SpectroStream extends its capability beyond 24 kHz monophonic audio and enables high-quality reconstruction of 48 kHz stereo music at bit rates of 4--16 kbps. This is accomplished with a new neural architecture that leverages audio representation in the time-frequency domain, which leads to better audio quality especially at higher sample rate. The model also uses a delayed-fusion strategy to handle multi-channel audio, which is crucial in balancing per-channel acoustic quality and cross-channel phase consistency.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.05207

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback